Introduction

Row

The Project

The Problem Description

This project examines my exercise activities recorded on a smartwatch to the Strava app. The goals are to try to predict what type of activity was recorded and how many calories were burned during the exercise using all other variables. I will first be looking at all variables to understand their shapes and sizes. I will then perform Regression analysis predicting the calories burned. I intend to use both linear and lasso regression, as well as other methods as needed. I will also perform Classification analysis predicting whether an activity was a run or a ride. I will use both logistic regression and classification trees. I will end this analysis with a conclusion explaining my best models, and describe the best predictors of both the continuous and qualitative response variables.

The Data

This dataset has 497 rows and 9 variables.

Data Sources

This dataset was recorded using a smartwatch on the Strava app. It is a combination of biometric & GPS data which I recorded when going on a run or ride.

The Data

VARIABLES TO PREDICT WITH

  • Moving Time (Minutes): The total time measured in minutes that the person was actually moving. This does not include time spent waiting at an intersection, or for the runner’s dog to go #2.
  • Elapsed Time (Minutes): The total time, measured in minutes, that the activity was being recorded.
  • Average Heart Rate: The average beats per minute of my heart over the course of the run according to my smartwatch.
  • Weather Temperature: Temperature in Fahrenheit supplied by The Weather Channel API for the location where the run or ride occurred.
  • Season: Winter, Spring, Summer, or Fall. This was originally the activity date, but I have converted it into season.
  • Max Heart Rate: The highest heart rate measured during the activity
  • Distance_K: length in kilometers measured during activity

VARIABLES WE WANT TO PREDICT

  • Activity Type: Activity is either a Run (1) or a Bike Ride (0)
  • Calories_Burned_Estimated: The number of calories burned according to Strava’s estimates

Data Exploration

Column

View the Data Summaries

Summaries of each of our variables is below. Better visibility to Season is at the bottom.
 Activity_Type   Distance_K     Elapsed_Time_Minutes Max_Heart_Rate 
 1:436         Min.   : 1.170   Min.   :  3.95       Min.   : 89.0  
 0: 61         1st Qu.: 4.860   1st Qu.: 25.73       1st Qu.:178.0  
               Median : 5.200   Median : 29.87       Median :182.0  
               Mean   : 7.572   Mean   : 37.68       Mean   :179.8  
               3rd Qu.: 9.830   3rd Qu.: 46.25       3rd Qu.:185.0  
               Max.   :81.730   Max.   :329.08       Max.   :196.0  
    Season          Average_Heart_Rate Calories_Burned_Estimated
 Length:497         Min.   : 75.22     Min.   :  23.72          
 Class :character   1st Qu.:150.78     1st Qu.: 449.00          
 Mode  :character   Median :157.52     Median : 504.09          
                    Mean   :154.81     Mean   : 602.40          
                    3rd Qu.:162.46     3rd Qu.: 792.21          
                    Max.   :177.03     Max.   :2825.00          
 Moving_Time_Minutes Weather_Temperature
 Min.   :  3.95      Min.   : 10.15     
 1st Qu.: 21.67      1st Qu.: 49.35     
 Median : 24.20      Median : 62.91     
 Mean   : 31.38      Mean   : 62.73     
 3rd Qu.: 39.87      3rd Qu.: 75.40     
 Max.   :217.08      Max.   :100.51     

Column

Count of Activity by Season

Most of our activities occurred in the Summer and Spring seasons.
Season Count freq
Summer 196 0.394
Spring 132 0.266
Winter 87 0.175
Fall 82 0.165

Data Visualization

Response Variables relationships with predictors

  • The grand majority of our activities were Runs as opposed to Rides

  • The histogram of our calories is not normal with a long right tail. We see the largest concentration of distances is just below 500 calories. It might look more normal if we looked at runs and rides separately.

row

Activity Type

Median Value

Row

Calories Burned vs Season

Calories Burned vs Continuous Variables

High Median Value vs Continuous Variables #1

High Median Value vs Categorical Variables

Initial Models

Predicting Continuous Median Value

Here is a look at a regression model predicting Calories Burned.
term estimate std.error statistic p.value
(Intercept) -399.673 108.169 -3.695 0.000
Distance_K -34.960 2.139 -16.340 0.000
Elapsed_Time_Minutes -2.813 0.565 -4.977 0.000
Max_Heart_Rate 2.076 0.896 2.316 0.021
SeasonSpring 46.660 18.156 2.570 0.010
SeasonSummer -16.192 19.008 -0.852 0.395
SeasonWinter -1.563 19.949 -0.078 0.938
Average_Heart_Rate 0.779 0.816 0.955 0.340
Moving_Time_Minutes 27.538 0.990 27.827 0.000
Weather_Temperature 0.146 0.425 0.343 0.732
.metric .estimator .estimate
rmse standard 123.878
rsq standard 0.863
mae standard 83.484

Predicting Categorical Median Value

Here is a look at a logistic regression model predicting Activity Type. We see extremely high p values for all of our predictors.
term estimate std.error statistic p.value
(Intercept) 63.643 83694.593 0.001 0.999
Distance_K 35.598 14015.355 0.003 0.998
Elapsed_Time_Minutes 0.183 1201.187 0.000 1.000
Max_Heart_Rate -0.207 1342.719 0.000 1.000
SeasonSpring -17.671 40896.706 0.000 1.000
SeasonSummer -6.158 36926.171 0.000 1.000
SeasonWinter -1.402 29123.394 0.000 1.000
Average_Heart_Rate -0.292 1253.322 0.000 1.000
Moving_Time_Minutes -10.655 2560.028 -0.004 0.997
Weather_Temperature 0.375 416.741 0.001 0.999
.metric .estimator .estimate
accuracy binary 1
specificity binary 1
sensitivity binary 1

Further Data Exploration

###Interactive exploratory graphs

row {data-height=550}


Max Heart Rate & Calories Burned Scatter Plot

Lighter colors indicate longer distances.

Plotly Interactive 3=2D Histogram Example

---
title: "Project Part Two Dashboard"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: scroll
    source_code: embed
    theme: yeti
---

```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output

library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.3
library(MASS) #v7.3-54 for Boston data
library(flexdashboard) #v0.5.2
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(tidymodels) 
library(readr)
  #library(dplyr) #v1.0.7 %>%, select(), select_if(), filter(), mutate(), group_by(), 
    #summarize(), tibble()
  #library(ggplot2) #v3.3.5 ggplot()
```

```{r load_data}
#Load the data
df <- read_csv("Strava_df2.csv")
df <- df %>%
  mutate(Activity_Type = ifelse(Activity_Type=="Run",1,0))
df$Activity_Type = factor(df$Activity_Type, levels=c(1,0))
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=600}
-----------------------------------------------------------------------
### The Project

#### The Problem Description
This project examines my exercise activities recorded on a smartwatch to the Strava app. The goals are to try to predict what type of activity was recorded and how many calories were burned during the exercise using all other variables. I will first be looking at all variables to understand their shapes and sizes. I will then perform Regression analysis predicting the calories burned. I intend to use both linear and lasso regression, as well as other methods as needed. I will also perform Classification analysis predicting whether an activity was a run or a ride. I will use both logistic regression and classification trees. I will end this analysis with a conclusion explaining my best models, and describe the best predictors of both the continuous and qualitative response variables.

#### The Data
This dataset has 497 rows and 9 variables. 

#### Data Sources
This dataset was recorded using a smartwatch on the Strava app. It is a combination of biometric & GPS data which I recorded when going on a run or ride. 

### The Data
VARIABLES TO PREDICT WITH

* **Moving Time (Minutes)**: The total time measured in minutes that the person was actually moving. This does not include time spent waiting at an intersection, or for the runner’s dog to go #2.
* **Elapsed Time (Minutes)**: The total time, measured in minutes, that the activity was being recorded. 
* **Average Heart Rate**: The average beats per minute of my heart over the course of the run according to my smartwatch.
* **Weather Temperature**:  Temperature in Fahrenheit supplied by The Weather Channel API for the location where the run or ride occurred.
* **Season**: Winter, Spring, Summer, or Fall. This was originally the activity date, but I have converted it into season.
* **Max Heart Rate**:  The highest heart rate measured during the activity 
* **Distance_K**: length in kilometers measured during activity


VARIABLES WE WANT TO PREDICT

* **Activity Type**:  Activity is either a Run (1) or a Bike Ride (0)
* **Calories_Burned_Estimated**: The number of calories burned according to Strava’s estimates 

Data Exploration {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=200}
-------------------------------------

### Data Overview 
We can see that Calories burned has an incredibly wide range. Also, Activity Type is mostly made up of 1s (Runs). Along the bottom of the page we can see that most activities occurred in the Spring and Summer.

Column {data-width=450, data-height=600}
-----------------------------------------------------------------------
### View the Data Summaries
Summaries of each of our variables is below. Better visibility to Season is at the bottom.
```{r, cache=TRUE}
#View data
summary(df)
```

Column {data-width=150, data-height=300}
-----------------------------------------------------------------------

### Count of Activity by Season
Most of our activities occurred in the Summer and Spring seasons.
```{r, cache=TRUE}
#Summary table for Season variable
knitr::kable(df %>%
  group_by(Season) %>%
  summarize(Count=n()) %>%
  mutate(freq = round(Count / sum(Count), 3)) %>% 
  arrange(desc(freq)))
```

Data Visualization {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors

* The grand majority of our activities were Runs as opposed to Rides

* The histogram of our calories is not normal with a long right tail. We see the largest concentration of distances is just below 500 calories. It might look more normal if we looked at runs and rides separately. 

row {data-height=550}
-----------------------------------------------------------------------
#### Activity Type

```{r, cache=TRUE}
ggplot(df,aes(x=Activity_Type)) + geom_bar()
```

#### Median Value
```{r, cache=TRUE}
ggplot(df, aes(Calories_Burned_Estimated)) + geom_histogram(bins=20)
```


Row {.tabset data-height=450}
-----------------------------------------------------------------------
### Calories Burned vs Season
```{r, cache=TRUE}
ggpairs(dplyr::select(df,Calories_Burned_Estimated,Season))
```

###  Calories Burned vs Continuous Variables


```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(df,Calories_Burned_Estimated,Elapsed_Time_Minutes,Max_Heart_Rate,Average_Heart_Rate,Distance_K,Moving_Time_Minutes,Weather_Temperature)))
```

### High Median Value vs Continuous Variables #1
```{r, cache=TRUE}
ggpairs(dplyr::select(df,Activity_Type,Elapsed_Time_Minutes,Max_Heart_Rate,Average_Heart_Rate,Distance_K,Moving_Time_Minutes,Weather_Temperature))
```


### High Median Value vs Categorical Variables
```{r, cache=TRUE}
df %>% group_by(Season, Activity_Type) %>%
  summarize(n=n()) %>%
  ggplot(aes(y=n, x=Activity_Type,fill=Season)) +
      geom_bar(position="dodge", stat="identity") +
      geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
      ggtitle("Activity Type vs Season") +
      coord_flip() #makes horizontal
```

Initial Models {data-orientation=rows}
=======================================================================
### Predicting Continuous Median Value

Here is a look at a regression model predicting Calories Burned.
```{r}

reg_spec <- linear_reg() %>% ## Class of problem  
   set_engine("lm") %>% ## The particular function that we use  
   set_mode("regression") ## type of model
#Fit the model
reg_fit <- reg_spec %>%  
   fit(Calories_Burned_Estimated ~ .-Activity_Type,data = df)
#Capture the predictions and metrics
pred_reg_fit <- augment(reg_fit, df)
knitr::kable(tidy(reg_fit$fit),
             digits=3)
knitr::kable(pred_reg_fit %>%
                   metrics(truth=Calories_Burned_Estimated,estimate=.pred),
             digits=3)
```

### Predicting Categorical Median Value

Here is a look at a logistic regression model predicting Activity Type. We see extremely high p values for all of our predictors.
```{r}
#Define the model specification
log_spec <- logistic_reg() %>%
             set_engine('glm') %>%
             set_mode('classification') 

#Fit the model
log_fit <- log_spec %>%
              fit(Activity_Type ~ .-Calories_Burned_Estimated, data = df)

#Capture the predictions and metrics
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)

pred_log_fit <- augment(log_fit, df)
knitr::kable(tidy(log_fit$fit),
             digits=3)
knitr::kable(pred_log_fit %>%
                my_class_metrics(truth=Activity_Type,estimate=.pred_class))
```


Further Data Exploration {data-orientation=rows}
=======================================================================
###Interactive exploratory graphs

row {data-height=550}

-----------------------------------------------------------------------
#### Max Heart Rate & Calories Burned Scatter Plot
Lighter colors indicate longer distances.

```{r}
library(plotly) #v4.9.4.1
fig <- plot_ly(df, x = ~Calories_Burned_Estimated, y = ~Max_Heart_Rate, type="scatter", mode="markers",symbol = ~Activity_Type, symbols = c('circle','cross'), color=~Distance_K) 

fig
```


#### Plotly Interactive 3=2D Histogram Example

```{r}
ggplotly(
df %>% group_by( Activity_Type) %>%
  summarize(Average_Distance=mean(Distance_K)) %>%
  ggplot(aes(y=Average_Distance, x=Activity_Type,fill=Activity_Type)) +
      geom_bar(position="dodge", stat="identity") +
      geom_text(aes(label=Average_Distance), position=position_dodge(width=0.9), vjust=-0.25) +
      ggtitle("Average Distance by Activity Type") +
      coord_flip() #makes horizontal
)
```